Predicting Interest Rates

Introduction

Both Blinda and I had a fascination with economics data and were particularly interested in finding ways to apply the techniques we have learned in data mining to the field of economics. In many cases, “economics” as an entire field seems way too daunting to attack as a whole. However, in some sense, that seems like a perfect reason to pick it for a data mining project! We could download a massive amount of rather tangentially related data, and find patterns or facts that we find interesting. In the end, our goal was to hopefully build some sort of algorithm or model that backs up the intuition that we have created learning Economics while at Columbia. Of course much of economics comes down to really cool new techniques for finding casual inferences like difference-in-difference and regression discontinuities, but we think some really interesting conclusions should be available with other tools.

As such our audience for this paper and topics that will be covered in the coming pages should be completely accessible to undergraduates and anyone that has even a very basic understanding of economics terms. While some of the features in the data may not be clear on their own, if they have significant or interesting results they will be explained. For example a term like M1 money. This is just economists making up a fancy term for immediately available money: money that is in checking accounts pretty much and available through debit cards or ATMS.

Datasets!

  • FRED
  • DBnomics
  • Census

Wrangle the data:

Dataset

First let’s see what the most popular sets of data are!

popular_funds_series <- fredr_series_search_text(
    search_text = "federal funds",
    order_by = "popularity",
    sort_order = "desc",
    limit = 5
)

popular_funds_series_id <- popular_funds_series$id

popular_funds_series_id
## [1] "M2"       "FEDFUNDS" "M2V"      "M2SL"     "DFF"
wanted_funds_series <- c("DGS10", "T10YIE", "CPIAUCSL", "GOLDAMGBD228NLBM")
df$dates <- as.Date(df$dates) 

df <- as_tibble(df)

dim(df)
## [1] 16270    28

Data Summarization

The data as a whole is a beautiful combination of the most downloaded data sources and a variety of different gauges of the market and the United States as a whole.

We initially began by graphing a variety of sources to see how they compare. At first, I thought something was wrong with our data. Below is a graph of Federal Surplus or Deficient (FYFSD) and the US real gross domestic product (GDP). No matter what we graphed against our deficit, everything else appeared to be flat. In reality, this just goes to show the shear size of the deficit the United States is running.

map_dfr(c("FYFSD", "GDPC1"), fredr) %>%
  ggplot(data = ., mapping = aes(x = date, y = value, color = series_id)) +
    geom_line() +
    labs(x = "Observation Date", y = "Rate", color = "Series")

When everything else is compared to each other, they seem more more reasonable. For example, this shows the Unemployment rate and the federal funds rate.

map_dfr(c("UNRATE", "FEDFUNDS"), fredr) %>%
  ggplot(data = ., mapping = aes(x = date, y = value, color = series_id)) +
    geom_line() +
    labs(x = "Observation Date", y = "Rate", color = "Series")

Fun visualization to compare GDP to GDPC1, or Real GDP. Real GDP is inflation adjusted GDP, meaning it shows the production output when we account for money slowly losing value.

map_dfr(c("GDP", "GDPC1"), fredr) %>%
  ggplot(data = ., mapping = aes(x = date, y = value, color = series_id)) +
    geom_line() +
    labs(x = "Observation Date", y = "Rate", color = "Series")

Some of the results are interesting to see if they match our intuition. For example, personal savings rate and vehicle sales!

map_dfr(c("TOTALSA", "PSAVERT"), fredr) %>%
  ggplot(data = ., mapping = aes(x = date, y = value, color = series_id)) +
    geom_line() +
    labs(x = "Observation Date", y = "Rate", color = "Series")

It seems much further off than I expected. Vehicle sales seem almost constant (in teal) with the exception of recessions, before immediately coming back. Personal savings however have seen a steady decrease regardless, although the stimulus appears to have had an amazing effect!

modern_df <- filter(df, dates >= "1980-01-01")
modern_df
## # A tibble: 10,928 x 28
##    dates      consumer_price_index_urban inflation_expectation ten_year_treasury
##    <date>                          <dbl>                 <dbl>             <dbl>
##  1 1980-01-01                         78                  10.4              NA  
##  2 1980-01-02                         NA                  NA                10.5
##  3 1980-01-03                         NA                  NA                10.6
##  4 1980-01-04                         NA                  NA                10.7
##  5 1980-01-07                         NA                  NA                10.6
##  6 1980-01-08                         NA                  NA                10.6
##  7 1980-01-09                         NA                  NA                10.6
##  8 1980-01-10                         NA                  NA                10.5
##  9 1980-01-11                         NA                  NA                10.7
## 10 1980-01-14                         NA                  NA                10.7
## # … with 10,918 more rows, and 24 more variables: ten_minus_two_treasury <dbl>,
## #   ten_minus_three_months_treasury <dbl>, unemployment_rate <dbl>,
## #   real_gdp <dbl>, eff_fed_funds_rate <dbl>, gdp <dbl>,
## #   thirty_year_fixed_mortgage <dbl>, velocity_of_mtwo <dbl>, m_one <dbl>,
## #   m_two <dbl>, all_employees_minus_farmers <dbl>,
## #   personal_savings_rate <dbl>, aaa_corporate_bond_yield <dbl>,
## #   personal_consumption_expenditures <dbl>, industrial_production_index <dbl>,
## #   federal_debt_total <dbl>, labor_force_participation_rate <dbl>,
## #   consumer_price_index_nationwide <dbl>, s_and_p <dbl>, m_three <dbl>,
## #   federal_surplus_or_deficit <dbl>, house_price_index_nationwide <dbl>,
## #   total_vehicle_sales <dbl>, federal_debt_percent_of_gdp <dbl>

Problems !

One of the immediate problems with the data was the somewhat random in when it was gathered. Some daily, some monthly, and some seemingly randomly. Others were done by quarter (once every three months), but of course they aren’t always released on the first of the month and because we have merged the data daily, it makes it really hard to run any algorithms on. So instead, we shall create a few very clean data frames that are averages. So we created two data frames that were grouped by month and year, averaging the values and ignoring missing values.

df_monthly <- df %>%
  group_by(month = floor_date(dates, "month")) %>%
  summarise(
    consumer_price_index_urban = mean(consumer_price_index_urban, na.rm = TRUE),
    inflation_expectation = mean(inflation_expectation, na.rm = TRUE),
    ten_year_treasury = mean(ten_year_treasury, na.rm = TRUE),
    ten_minus_two_treasury = mean(ten_minus_two_treasury, na.rm = TRUE),
    ten_minus_three_months_treasury = mean(ten_minus_three_months_treasury, na.rm = TRUE),
    unemployment_rate = mean(unemployment_rate, na.rm = TRUE),
    real_gdp = mean(real_gdp, na.rm = TRUE),
    eff_fed_funds_rate = mean(eff_fed_funds_rate, na.rm = TRUE),
    gdp = mean(gdp, na.rm = TRUE),
    thirty_year_fixed_mortgage = mean(thirty_year_fixed_mortgage, na.rm = TRUE),
    m_two = mean(m_two, na.rm = TRUE),
    velocity_of_mtwo = mean(velocity_of_mtwo, na.rm = TRUE),
    m_one = mean(m_one, na.rm = TRUE),
    all_employees_minus_farmers = mean(all_employees_minus_farmers, na.rm = TRUE),
    aaa_corporate_bond_yield = mean(aaa_corporate_bond_yield, na.rm = TRUE),
    personal_savings_rate = mean(personal_savings_rate, na.rm = TRUE),
    personal_consumption_expenditures = mean(personal_consumption_expenditures, na.rm = TRUE),
    industrial_production_index = mean(industrial_production_index, na.rm = TRUE),
    federal_debt_total = mean(federal_debt_total, na.rm = TRUE),
    labor_force_participation_rate = mean(labor_force_participation_rate, na.rm = TRUE),
    consumer_price_index_nationwide = mean(consumer_price_index_nationwide, na.rm = TRUE),
    s_and_p = mean(s_and_p, na.rm = TRUE),
    m_three = mean(m_three, na.rm = TRUE),
    federal_surplus_or_deficit = mean(federal_surplus_or_deficit, na.rm = TRUE),
    house_price_index_nationwide = mean(house_price_index_nationwide, na.rm = TRUE),
    total_vehicle_sales = mean(total_vehicle_sales, na.rm = TRUE),
    federal_debt_percent_of_gdp = mean(federal_debt_percent_of_gdp, na.rm = TRUE)
)
df_yearly <- df %>%
  group_by(year = floor_date(dates, "year")) %>%
  summarise(
    consumer_price_index_urban = mean(consumer_price_index_urban, na.rm = TRUE),
    inflation_expectation = mean(inflation_expectation, na.rm = TRUE),
    ten_year_treasury = mean(ten_year_treasury, na.rm = TRUE),
    ten_minus_two_treasury = mean(ten_minus_two_treasury, na.rm = TRUE),
    ten_minus_three_months_treasury = mean(ten_minus_three_months_treasury, na.rm = TRUE),
    unemployment_rate = mean(unemployment_rate, na.rm = TRUE),
    real_gdp = mean(real_gdp, na.rm = TRUE),
    eff_fed_funds_rate = mean(eff_fed_funds_rate, na.rm = TRUE),
    gdp = mean(gdp, na.rm = TRUE),
    thirty_year_fixed_mortgage = mean(thirty_year_fixed_mortgage, na.rm = TRUE),
    m_two = mean(m_two, na.rm = TRUE),
    velocity_of_mtwo = mean(velocity_of_mtwo, na.rm = TRUE),
    m_one = mean(m_one, na.rm = TRUE),
    all_employees_minus_farmers = mean(all_employees_minus_farmers, na.rm = TRUE),
    aaa_corporate_bond_yield = mean(aaa_corporate_bond_yield, na.rm = TRUE),
    personal_savings_rate = mean(personal_savings_rate, na.rm = TRUE),
    personal_consumption_expenditures = mean(personal_consumption_expenditures, na.rm = TRUE),
    industrial_production_index = mean(industrial_production_index, na.rm = TRUE),
    federal_debt_total = mean(federal_debt_total, na.rm = TRUE),
    labor_force_participation_rate = mean(labor_force_participation_rate, na.rm = TRUE),
    consumer_price_index_nationwide = mean(consumer_price_index_nationwide, na.rm = TRUE),
    s_and_p = mean(s_and_p, na.rm = TRUE),
    m_three = mean(m_three, na.rm = TRUE),
    federal_surplus_or_deficit = mean(federal_surplus_or_deficit, na.rm = TRUE),
    house_price_index_nationwide = mean(house_price_index_nationwide, na.rm = TRUE),
    total_vehicle_sales = mean(total_vehicle_sales, na.rm = TRUE),
    federal_debt_percent_of_gdp = mean(federal_debt_percent_of_gdp, na.rm = TRUE)
)

df_yearly_cleaned <- na.omit(df_yearly)

df_yearly_cleaned
## # A tibble: 34 x 28
##    year       consumer_price_index_urban inflation_expectation ten_year_treasury
##    <date>                          <dbl>                 <dbl>             <dbl>
##  1 1987-01-01                       114.                  3.13              8.39
##  2 1988-01-01                       118.                  3.68              8.85
##  3 1989-01-01                       124.                  3.8               8.49
##  4 1990-01-01                       131.                  4.13              8.55
##  5 1991-01-01                       136.                  3.19              7.86
##  6 1992-01-01                       140.                  2.82              7.01
##  7 1993-01-01                       144.                  3.08              5.87
##  8 1994-01-01                       148.                  3                 7.09
##  9 1995-01-01                       152.                  2.95              6.57
## 10 1996-01-01                       157.                  2.98              6.44
## # … with 24 more rows, and 24 more variables: ten_minus_two_treasury <dbl>,
## #   ten_minus_three_months_treasury <dbl>, unemployment_rate <dbl>,
## #   real_gdp <dbl>, eff_fed_funds_rate <dbl>, gdp <dbl>,
## #   thirty_year_fixed_mortgage <dbl>, m_two <dbl>, velocity_of_mtwo <dbl>,
## #   m_one <dbl>, all_employees_minus_farmers <dbl>,
## #   aaa_corporate_bond_yield <dbl>, personal_savings_rate <dbl>,
## #   personal_consumption_expenditures <dbl>, industrial_production_index <dbl>,
## #   federal_debt_total <dbl>, labor_force_participation_rate <dbl>,
## #   consumer_price_index_nationwide <dbl>, s_and_p <dbl>, m_three <dbl>,
## #   federal_surplus_or_deficit <dbl>, house_price_index_nationwide <dbl>,
## #   total_vehicle_sales <dbl>, federal_debt_percent_of_gdp <dbl>

In fact, to show a point of how inconcistent some of the data was, simply cleaning the data by pulling only months where all data series had at least an entry yields absolutely no months. So instead, for the monthly approach, we decided to take a different approach. Instead, we started eliminating poor data features, ones that didn’t line up enough with the majority of the data we had.

Explore

df_monthly_modern <- df_monthly %>% 
  filter(month >= "1987-01-01")

monthly_nas <- df_monthly_modern %>% 
  summarise_all(funs(sum(is.na(.))))
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
monthly_nas
## # A tibble: 1 x 28
##   month consumer_price_ind… inflation_expect… ten_year_treasu… ten_minus_two_tr…
##   <int>               <int>             <int>            <int>             <int>
## 1     0                   2                 2                0                 0
## # … with 23 more variables: ten_minus_three_months_treasury <int>,
## #   unemployment_rate <int>, real_gdp <int>, eff_fed_funds_rate <int>,
## #   gdp <int>, thirty_year_fixed_mortgage <int>, m_two <int>,
## #   velocity_of_mtwo <int>, m_one <int>, all_employees_minus_farmers <int>,
## #   aaa_corporate_bond_yield <int>, personal_savings_rate <int>,
## #   personal_consumption_expenditures <int>, industrial_production_index <int>,
## #   federal_debt_total <int>, labor_force_participation_rate <int>,
## #   consumer_price_index_nationwide <int>, s_and_p <int>, m_three <int>,
## #   federal_surplus_or_deficit <int>, house_price_index_nationwide <int>,
## #   total_vehicle_sales <int>, federal_debt_percent_of_gdp <int>

This yielded an interesting finding. In fact, there appear to be two types of data here. Data collected quarterly and data collected monthly (or even more granularly). Then there are exceptions like total federal debt which is collected once each year in september. So in this way, the data was split into 2 groups.

data_monthly <- df_monthly %>% 
  select(-real_gdp, -gdp, -velocity_of_mtwo, -federal_debt_total, -federal_surplus_or_deficit, -house_price_index_nationwide, -house_price_index_nationwide, -federal_debt_percent_of_gdp) %>% 
  na.omit()
  
data_monthly
## # A tibble: 409 x 21
##    month      consumer_price_index_urban inflation_expectation ten_year_treasury
##    <date>                          <dbl>                 <dbl>             <dbl>
##  1 1987-01-01                       111.                   2.9              7.08
##  2 1987-02-01                       112.                   3.1              7.25
##  3 1987-03-01                       112.                   3                7.25
##  4 1987-04-01                       113.                   3                8.02
##  5 1987-05-01                       113                    3.4              8.61
##  6 1987-06-01                       114.                   3.3              8.40
##  7 1987-07-01                       114.                   3.1              8.45
##  8 1987-08-01                       114.                   3.2              8.76
##  9 1987-09-01                       115.                   3                9.42
## 10 1987-10-01                       115                    3.3              9.52
## # … with 399 more rows, and 17 more variables: ten_minus_two_treasury <dbl>,
## #   ten_minus_three_months_treasury <dbl>, unemployment_rate <dbl>,
## #   eff_fed_funds_rate <dbl>, thirty_year_fixed_mortgage <dbl>, m_two <dbl>,
## #   m_one <dbl>, all_employees_minus_farmers <dbl>,
## #   aaa_corporate_bond_yield <dbl>, personal_savings_rate <dbl>,
## #   personal_consumption_expenditures <dbl>, industrial_production_index <dbl>,
## #   labor_force_participation_rate <dbl>,
## #   consumer_price_index_nationwide <dbl>, s_and_p <dbl>, m_three <dbl>,
## #   total_vehicle_sales <dbl>
data_quarterly <- df_monthly %>% 
  select(-federal_debt_total, - federal_debt_percent_of_gdp, -federal_surplus_or_deficit) %>% 
  na.omit()

data_quarterly
## # A tibble: 136 x 25
##    month      consumer_price_index_urban inflation_expectation ten_year_treasury
##    <date>                          <dbl>                 <dbl>             <dbl>
##  1 1987-01-01                       111.                   2.9              7.08
##  2 1987-04-01                       113.                   3                8.02
##  3 1987-07-01                       114.                   3.1              8.45
##  4 1987-10-01                       115                    3.3              9.52
##  5 1988-01-01                       116                    3.2              8.67
##  6 1988-04-01                       117.                   3.3              8.72
##  7 1988-07-01                       118.                   4.6              9.06
##  8 1988-10-01                       120.                   3.9              8.80
##  9 1989-01-01                       121.                   3.5              9.09
## 10 1989-04-01                       123.                   4.3              9.18
## # … with 126 more rows, and 21 more variables: ten_minus_two_treasury <dbl>,
## #   ten_minus_three_months_treasury <dbl>, unemployment_rate <dbl>,
## #   real_gdp <dbl>, eff_fed_funds_rate <dbl>, gdp <dbl>,
## #   thirty_year_fixed_mortgage <dbl>, m_two <dbl>, velocity_of_mtwo <dbl>,
## #   m_one <dbl>, all_employees_minus_farmers <dbl>,
## #   aaa_corporate_bond_yield <dbl>, personal_savings_rate <dbl>,
## #   personal_consumption_expenditures <dbl>, industrial_production_index <dbl>,
## #   labor_force_participation_rate <dbl>,
## #   consumer_price_index_nationwide <dbl>, s_and_p <dbl>, m_three <dbl>,
## #   house_price_index_nationwide <dbl>, total_vehicle_sales <dbl>

For intuition, I decided to run regression on the yearly dataset, as it has a few additional features that we lose when picking quarterly or monthly, so I thought it couldn’t hurt to try. As for what to predict, I thought the most interesting would actually be the inflation expectation. This is a metrix calculated by Michigan State which says what do American’s think the interest rate is going to do in the immediate future. In a sense, this allows us to approximate what the effects of all the other features are on Americans.

linear_model_yearly <- lm(data = df_yearly_cleaned, inflation_expectation ~ .)

summary(linear_model_yearly)
## 
## Call:
## lm(formula = inflation_expectation ~ ., data = df_yearly_cleaned)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.145563 -0.040709  0.000795  0.044900  0.170767 
## 
## Coefficients: (1 not defined because of singularities)
##                                     Estimate Std. Error t value Pr(>|t|)  
## (Intercept)                       -2.569e+01  2.325e+01  -1.105   0.3057  
## year                               2.024e-03  1.834e-03   1.104   0.3062  
## consumer_price_index_urban         4.908e-03  1.093e-01   0.045   0.9654  
## ten_year_treasury                 -2.544e-01  7.424e-01  -0.343   0.7419  
## ten_minus_two_treasury            -4.532e-01  3.484e-01  -1.301   0.2345  
## ten_minus_three_months_treasury    5.473e-02  8.207e-01   0.067   0.9487  
## unemployment_rate                 -1.392e+00  5.168e-01  -2.693   0.0309 *
## real_gdp                          -1.147e-03  2.037e-03  -0.563   0.5909  
## eff_fed_funds_rate                -2.164e-01  6.187e-01  -0.350   0.7368  
## gdp                                4.179e-04  1.804e-03   0.232   0.8235  
## thirty_year_fixed_mortgage        -3.358e-01  5.074e-01  -0.662   0.5293  
## m_two                             -3.400e-03  1.405e-03  -2.420   0.0461 *
## velocity_of_mtwo                  -9.242e+00  4.372e+00  -2.114   0.0724 .
## m_one                              1.339e-03  5.590e-04   2.395   0.0478 *
## all_employees_minus_farmers       -4.551e-04  2.930e-04  -1.553   0.1643  
## aaa_corporate_bond_yield           8.332e-01  5.072e-01   1.643   0.1444  
## personal_savings_rate             -3.605e-02  8.213e-02  -0.439   0.6740  
## personal_consumption_expenditures  4.257e-03  2.079e-03   2.048   0.0798 .
## industrial_production_index       -1.421e-04  5.567e-02  -0.003   0.9980  
## federal_debt_total                 7.994e-08  7.471e-07   0.107   0.9178  
## labor_force_participation_rate     1.538e+00  6.437e-01   2.389   0.0483 *
## consumer_price_index_nationwide    1.717e+00  7.822e-01   2.195   0.0642 .
## s_and_p                            1.171e-01  7.062e-02   1.658   0.1412  
## m_three                                   NA         NA      NA       NA  
## federal_surplus_or_deficit        -3.025e-07  8.374e-07  -0.361   0.7285  
## house_price_index_nationwide      -9.394e-02  4.795e-02  -1.959   0.0909 .
## total_vehicle_sales               -2.854e-01  1.199e-01  -2.380   0.0489 *
## federal_debt_percent_of_gdp       -3.299e-02  1.141e-01  -0.289   0.7808  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.156 on 7 degrees of freedom
## Multiple R-squared:  0.9743, Adjusted R-squared:  0.8789 
## F-statistic: 10.21 on 26 and 7 DF,  p-value: 0.00199
plot(df_yearly_cleaned[1:10, 1:10])

Unsurprisingly, there is not a ton of significance throughout, and there is likely high colinearity between many of these. However, it is interesting to look at what was pulled through so far. Of the features that had a significance code, total vehicle sales and M2 money supply seem to be the most impactful. M2 money is a count of all money including cash and checking deposits (M1) as well as money saved in savings accounts, money market securities, mutual funds and other time deposits. In a sense, it is a good gauge of how much physical money Americans have that they are able to spend.

df_yearly_minus_date <- df_yearly_cleaned %>% select(-year)

correlation_df <- round(cor(df_yearly_minus_date),2)

correlation_df_melt <- melt(correlation_df)
gz <- ggplot(correlation_df_melt, mapping = aes(x = Var1, y = Var2, fill = value)) + 
  geom_tile() + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  theme(text = element_text(size = 8)) + 
  ggtitle("Heat map for correlation") + 
  ylab("")+
  xlab("")+
  scale_fill_distiller(palette = "RdPu")
ggplotly(gz, tooltip = "text")

From an even very brief glance, we can see quite a few of our features of very correlated. In a sense, this should be expected in a finance market with everything working together. For example, the Effective federal funds rate is designed to regulate many of the other features we have in our dataset, so it seems reasonable they are correlated. Also, some features are derivatives of others such as federal deficit and federal debt.

findCorrelation(
  cor(df_yearly_minus_date),
  cutoff = 0.99,
  verbose = FALSE,
  names = FALSE
)
## [1] 23 11  9 17  1  3 25
ggcorrplot(correlation_df, hc.order = TRUE, type = "lower")

Although the text is not readable (for now), the correlation is better shown when only highly correlated values are left in.

Of course, now the immediate next step is likely to try Ridge, Lasso and Elasticnet. These should “fix”, at least to some extent the problem of multicolinearity.

Working with monthly data

To get a more granular approach which is more appropriate, the other techniques will be tried on the monthly data instead of the yearly.

Train and Test

Note: It may be worth while to bootstrap or upsample here.

The data will be split into train and test datasets. The model will be trained on the train data and evaluated on the test data.

data_monthly_dateless <- data_monthly %>% 
  select(-month)

set.seed(100) 


index = sample(1:nrow(data_monthly_dateless), 0.7*nrow(data_monthly_dateless)) 

train = data_monthly_dateless[index,] # Create the training data 
test = data_monthly_dateless[-index,] # Create the test data
dim(train)
## [1] 286  20
dim(test)
## [1] 123  20
cols <-  colnames(data_monthly_dateless) 

pre_proc_val <- preProcess(train[,cols], method = c("center", "scale"))
train[,cols] = predict(pre_proc_val, train[,cols])
test[,cols] = predict(pre_proc_val, test[,cols])

Scaled Linear Model

linear_scaled <- lm(inflation_expectation ~ . , data = train)
summary(linear_scaled)
## 
## Call:
## lm(formula = inflation_expectation ~ ., data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.10065 -0.34969 -0.02482  0.24613  2.23488 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        1.774e-12  3.469e-02   0.000 1.000000    
## consumer_price_index_urban         6.328e+00  7.136e-01   8.868  < 2e-16 ***
## ten_year_treasury                 -1.860e+00  4.491e-01  -4.142 4.63e-05 ***
## ten_minus_two_treasury            -1.917e-01  1.565e-01  -1.225 0.221646    
## ten_minus_three_months_treasury   -1.567e-02  2.808e-01  -0.056 0.955535    
## unemployment_rate                 -9.710e-01  1.317e-01  -7.371 2.15e-12 ***
## eff_fed_funds_rate                -1.243e-01  5.157e-01  -0.241 0.809798    
## thirty_year_fixed_mortgage         1.921e+00  4.961e-01   3.873 0.000136 ***
## m_two                             -3.868e+04  2.981e+04  -1.297 0.195636    
## m_one                              2.238e-01  1.179e-01   1.897 0.058867 .  
## all_employees_minus_farmers       -5.015e+00  6.042e-01  -8.301 5.24e-15 ***
## aaa_corporate_bond_yield           1.398e+00  4.750e-01   2.943 0.003537 ** 
## personal_savings_rate              7.948e-02  1.188e-01   0.669 0.504036    
## personal_consumption_expenditures  3.995e-01  1.053e+00   0.379 0.704626    
## industrial_production_index        8.192e-01  3.297e-01   2.485 0.013579 *  
## labor_force_participation_rate     3.401e-01  2.694e-01   1.262 0.207919    
## consumer_price_index_nationwide    3.069e-01  3.847e-02   7.978 4.45e-14 ***
## s_and_p                            3.955e-01  1.639e-01   2.414 0.016460 *  
## m_three                            3.868e+04  2.981e+04   1.297 0.195662    
## total_vehicle_sales               -4.769e-02  8.302e-02  -0.574 0.566146    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5866 on 266 degrees of freedom
## Multiple R-squared:  0.6788, Adjusted R-squared:  0.6559 
## F-statistic: 29.59 on 19 and 266 DF,  p-value: < 2.2e-16

These are more exciting results! Using the scaled monthly data, we have some very significant results with non-trivial estimates. Things that we think effect our inflation expectation: how the cost of a basket of goods changes nationwide, the ten year treasury note, the unemployment rate, the 30 fixed mortgage rate, number of employees in the US. As would be expected, the bonds, the unemployment rate and the number of employees are all inversely correlated. We would assume prices won’t be forced up when there are still people that want work, or when the bond yields are really high.

As for the linear model itself, it is not terrific but also not as poor of a model as one might expect. Our \(R^2\) values are above .65 and our RMSE is within reason.

#Step 1 - create the evaluation metrics function

eval_metrics = function(model, df, predictions, target){
    resids = df[,target] - predictions
    resids2 = resids**2
    N = length(predictions)
    r2 = as.character(round(summary(model)$r.squared, 2))
    adj_r2 = as.character(round(summary(model)$adj.r.squared, 2))
    print(c("Adjusted R^2", adj_r2)) #Adjusted R-squared
    print(c("RMSE", as.character(round(sqrt(sum(resids2)/N), 2)))) #RMSE
}

predictions = predict(linear_scaled, newdata = train)
eval_metrics(linear_scaled, train, predictions, target = 'inflation_expectation')
## [1] "Adjusted R^2" "0.66"        
## [1] "RMSE" "0.57"
# Step 3 - predicting and evaluating the model on test data
predictions = predict(linear_scaled, newdata = test)
eval_metrics(linear_scaled, test, predictions, target = 'inflation_expectation')
## [1] "Adjusted R^2" "0.66"        
## [1] "RMSE" "0.78"

\(R^2\) remains the same and RMSE increases modestly on test data. Still not wonderful but better than I would naturally expect!

# Regularization
cols_reg = cols

dummies <- dummyVars(inflation_expectation ~ ., data = data_monthly_dateless[,cols_reg])

train_dummies = predict(dummies, newdata = train[,cols_reg])

test_dummies = predict(dummies, newdata = test[,cols_reg])

print(dim(train_dummies)); print(dim(test_dummies))
## [1] 286  19
## [1] 123  19

Initally, because of the extreme correlation in some, I would expect lasso to be the technique we eventually want to resort to but I figure we might as well try ridge as well.

# ridge

x = as.matrix(train_dummies)
y_train = train$inflation_expectation

x_test = as.matrix(test_dummies)
y_test = test$inflation_expectation

lambdas <- 10^seq(4, -3, by = -.1)
ridge_reg = glmnet(x, y_train, nlambda = 25, alpha = 0, lambda = lambdas)

cv_ridge <- cv.glmnet(x, y_train, alpha = 0, lambda = lambdas)
optimal_lambda <- cv_ridge$lambda.min
optimal_lambda
## [1] 0.001
plot(ridge_reg, main = "Ridge Regression")

Optimal lambda comes out at 0.001. I’m skeptical of this value immediately, and it may require more exploration to see if there are optimizations to improve it.

#lasso
lambdas <- 10^seq(4, -3, by = -.1)

# Setting alpha = 1 implements lasso regression
lasso_reg <- cv.glmnet(x, y_train, alpha = 1, lambda = lambdas, standardize = TRUE, nfolds = 5)

# Best 
lambda_best <- lasso_reg$lambda.min 
lambda_best
## [1] 0.001
lambdas <- 10^seq(2, -3, by = -.1)

# Setting alpha = 1 implements lasso regression
lasso_reg <- cv.glmnet(x, y_train, alpha = 1, lambda = lambdas, standardize = TRUE, nfolds = 5)

# Best 
lambda_best <- lasso_reg$lambda.min 
lambda_best
## [1] 0.001
plot(lasso_reg, main = "Lasso Regression")

mean(lasso_reg$lambda.1se)
## [1] 0.003162278

Taking a step back to look at what we are estimating. The expected values are between 0 and 6 effectively, for what will be the inflation in the immediate future. Really where our modeling might be letting us down is in the fact that the majority of the time, you can guess 2.5 to 3.5 % inflation and be right.

#elastic net

histogram(data_monthly_dateless$inflation_expectation, breaks = 20)

cv_5 = trainControl(method = "cv", number = 5)

inflation_elnet <- train(
  inflation_expectation ~.,data = data_monthly_dateless, 
  method = "glmnet", 
  trControl = cv_5
)

inflation_elnet
## glmnet 
## 
## 409 samples
##  19 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 327, 327, 328, 327, 327 
## Resampling results across tuning parameters:
## 
##   alpha  lambda        RMSE       Rsquared   MAE      
##   0.10   0.0003948395  0.3732175  0.5602756  0.2690492
##   0.10   0.0039483948  0.4043822  0.4873458  0.2768239
##   0.10   0.0394839484  0.4509953  0.3678330  0.3088071
##   0.55   0.0003948395  0.3723209  0.5620107  0.2692984
##   0.55   0.0039483948  0.4031312  0.4916440  0.2750844
##   0.55   0.0394839484  0.4800511  0.2835748  0.3331703
##   1.00   0.0003948395  0.3702357  0.5667271  0.2689143
##   1.00   0.0039483948  0.3954760  0.5125016  0.2718174
##   1.00   0.0394839484  0.4962108  0.2336807  0.3458847
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 1 and lambda = 0.0003948395.
#similar results

inflation_elnet_best <- train(
  inflation_expectation ~ . ^ 2, data = data_monthly_dateless,
  method = "glmnet",
  trControl = cv_5,
  tuneLength = 10
)

get_best_result = function(caret_fit) {
  best = which(rownames(caret_fit$results) == rownames(caret_fit$bestTune))
  best_result = caret_fit$results[best, ]
  rownames(best_result) = NULL
  best_result
}

get_best_result(inflation_elnet_best)
##   alpha     lambda      RMSE Rsquared       MAE    RMSESD RsquaredSD      MAESD
## 1     1 0.00140774 0.3400945 0.649796 0.2377596 0.0615206 0.08045943 0.03120088

We end up with very similar results: because alpha is at 1 so it is practicly doing the exact same algorithm as above.

Principal Component Analysis

obs_x <- na.omit(data_monthly) %>% 
  rename(date = month)

ord <- order(obs_x$date)
obs_x <- obs_x[ord, ]
dates <- as.Date(obs_x$date)
obs_x$date <- NULL
obs_x <- sapply(obs_x, diff)

pr_out <- prcomp(obs_x, center=TRUE, scale=TRUE)
plot(cumsum(pr_out$sdev^2)/ sum(pr_out$sdev^2))
abline(h = 0.95)

#same number of rows as x, with only two most efficient columns
W <-  pr_out$x[,1:2]

plot(pr_out$sdev, main="")

Appears to be a noticable drop off at 5 actually.

# cor matrix

# correlation <-round(cor(df_test,use = "complete.obs"),2)
# correlation[upper.tri(correlation)] <- 0
# correlation
# 
# # print it out
# # library(xtable)
# # print(xtable(upper), type="html")
# 
# 
# 
# 
# # stages of the economic cycle (....)
# 
# 
# # k means
# # relatiom btw FEDFUNDS, Gold, CPI; cover any confonder
# 
# df_km <- cbind(DGS10_chg, FEDFUNDS_chg,CPIAUCS_chg,GOLDAMGBD228NLBM_chg)
# 
# df_km <- na.omit(df_km)
# 
# km_out <- kmeans(df_km, centers = 5)  #goodness of the classification k-means
# 
# output <- list()
# for(i in 1:6){
#         output[[i]] <- kmeans(coredata(df_km), i)
# }
# output  # maybe k =5
# 
# 
# # number of observation
# km_out <- kmeans(df_km, centers = 5)
# km_out$size
# km_out$centers

Engineer features based on population

Here we introduce population data from the census in order to hopefully look at per capita figures. This should hopefully find some new realizations. The US Census API actually has data for the entire world which is fascinating! So maybe we

lex <- get_idb(
  country = "all",
  year = 2021,
  variables = c("name", "e0"),
  geometry = TRUE
)

ggplot(lex, aes(fill = e0)) + 
  theme_bw() + 
  geom_sf() + 
  coord_sf(crs = 'ESRI:54030') + 
  scale_fill_viridis_c() + 
  labs(fill = "Life expectancy \nat birth (2021)")

population <- get_idb(
  country = "USA",
  year = 1978:2021,
  age = 0:100,
  sex = c("male", "female")
) %>% 
  group_by(year) %>% 
  summarise(population = sum(pop)) 

ggplot(data = population, mapping = aes(x = year, y = population)) + 
  geom_point() + 
  geom_line() + 
  scale_y_continuous("Populations") +
  scale_x_continuous()+
  labs(title = "Population of the USA", subtitle = "Nothing very surprising here")

Out of caution, this will only be applied to the yearly data as this is the most granular level we have. In the future, you could linearly estimate the data assuming an equal distribution of birthdays throughout the year and that should be close.

population$year <- as.Date(ISOdate(population$year, 1, 1))

data_per_capita <- merge(df_yearly_cleaned, population) %>% 
  mutate(car_sales_percap = total_vehicle_sales / population) %>% 
  mutate(house_price_percap = house_price_index_nationwide / population) %>% 
  mutate(s_and_p_percap = s_and_p / population) %>% 
  mutate(m_two_percap = m_two / population)

The reason to create this type of feature is actually a much more helpful way of standardizing some of these values. For example, car sales will naturally climb with more people. But if it climbs per capita that would be a good indicator that inflationary levels and expecations. House prices could be affected in a similar way.

From our initial regression, we know that m2 money was important. However, this could be inappropriate because there should be more money with a lot more people so this is helpful to make sure that truly is significant.

Conclusion

Value

Saved time? Decreased uncertainty? What did we find?

Checklist [] algorithm to explore the relationship between different features in the data - graphical summary - quantitative evaluation - engineer a feature

[] Veryify results - subset? - How robust is this? - Uncertainty? How does this affect the conclusions

[] Discussion of data dredging/snooping

[] github readme